In [ ]:
import sys
sys.path.append('../audio_processing/')
import audio_processing as ap
import librosa
import IPython.display as ipd 
import numpy as np
import musdb
import matplotlib.pyplot as plt
import residual as res

Residual Analysis¶

By Coby Wilcox (4812764) PSTAT 174

In this report, I'll try to look at the errors of my generated sources and try to answer questions like:

Where is the error mainly coming from? What are the different types of error in audio processing? How do these different types of error sound, appear visually, and mathematically? Are there ways to address the error created?

To start, let's look at the mask I create in the spectrogram limits report.

In [ ]:
stems, track, sr = res.load_waltz()
Your browser does not support the audio element.

Visual Analysis¶

This seems like the natural place to start. If we represent these audio files visually in their spectrogram/wav form do we see much difference between the two? Can we go up close and see more differences, or are the differences apparent enough where this isn't necessary.

Here we can see the residuals mapped out onto one visual so you can see exactly what belongs to the target source and what belongs the estimated source and what do they share.

In [ ]:
spec_data, masks = res.make_masks(stems)

Interpretting these masks can be a bit challenging as its hard to make out what is exactly what is going on in these spectrograms. I think to understand them the best one is to look at is the bass spectrograms. In the target, we can see all the sound is coming from the low frequencies which is to be expected from the bass.

Then for mask spectrograms its best to see them as a weight matrix with only neutral or negative weights. Remember these are always applied to the mix spectrogram shown at the top left corner. Blue means the values aren't affected and red means the values are negatively weighted. So the bass mask will mostly remove audio signals that correlate to high frequencies while preserving most of the low frequencies.

Mathematical Understanding¶

I think to understand the masks a little more it can help to remind yourself how these are being calculated. The masks themselves are made using the target spectrograms themselves, which is typically impsossible, but for the sake of demonstration is okay. They are calculated using a simple nonlinear formula where the weight is equal to the percentage the target makes up of the overall mix signal.

$$m_i = \frac{s_i}{max(s, s_i)}, s=\text{Mix STFT Representation}, s_i=\text{Target STFT Representation}$$

So if the target signal is more prevalent at the frequency than the mix is, the mask does not affect that frequency. If the mix is more prevalent though, then the mix is just multiplied by the fraction the target makes up of that frequency. You may also be confused by the appearance of negative decibels in the colorbar. Generally negative decibels are sounds with magnitudes below the human ability to perceive them, but in this case, it can actually very eloquently explain what the mask does to an extent. Human beings hear on a $log_{10}$ scale which is what decibels represent, i.e. if a sound has time times more magnitude we hear it as just twice as loud. And a nice property of logs you may remember is:

$$log(a/b) = log(a)-log(b)$$

Which is exactly what we are doing in these representations. We are dividing the target source by the overall mix, so the masks are essentially just subtracting the right amount of decibels from the mix itself.

In [ ]:
s_hat = res.apply_masks(spec_data, masks)

Looks pretty good! Almost identical, however if we graph the residuals hopefully the differences will become a little more apparent.

In [ ]:
residuals = res.plot_residuals_spec(spec_data, s_hat)

Looking at this the differences are much more apparent. There are a couple of things to note.

  • The errors generally seem very sporadic and almost random.
  • The errors do appear to possibly happen more often in very contested frequency/time ranges.
In [ ]:
est_signals, true_signals = res.playback_sources(spec_data, s_hat)
True Signal: Drums
Your browser does not support the audio element.
Estimated Signal: Drums
Your browser does not support the audio element.
True Signal: Bass
Your browser does not support the audio element.
Estimated Signal: Bass
Your browser does not support the audio element.
True Signal: Other
Your browser does not support the audio element.
Estimated Signal: Other
Your browser does not support the audio element.
True Signal: Vocals
Your browser does not support the audio element.
Estimated Signal: Vocals
Your browser does not support the audio element.

Listening to the generated sources, they are nearly identical to the exact sources. But there are little bits of distortion and artifacts within the new signals, they can be hard to hear, the one that is the most easily perceptible is the vocals as our ears are more sensitive to changes in peoples voices than in instrument sounds. Most likely these are the little bits of noise that we found in the residual analysis.

So, where are these pieces of noise coming from exactly? These estimations should have little to no error, as they are generated using the actual source signals themselves. I'll try to answer this question in the next sections.

In [ ]:
res.plot_audio_signals(est_signals, true_signals)
In [ ]:
res.display_csd(est_signals, true_signals)
In [ ]:
res_signals = res.plot_residual_signals(est_signals, true_signals)
In [ ]:
sources = ['Drums','Bass','Other','Vocals']
for i,resid in enumerate(res_signals):
    print('Residual:', sources[i])
    ipd.display(ipd.Audio(resid, rate=sr))
Residual: Drums
Your browser does not support the audio element.
Residual: Bass
Your browser does not support the audio element.
Residual: Other
Your browser does not support the audio element.
Residual: Vocals
Your browser does not support the audio element.

Looking at these residuals what you can realize is, they are just a tiny sliver of the actual target audio that didn't seem to make it in. This makes sense, when you remember how the mask worked in for the estimated sources. All it does it subtract loudness out of specific frequencies in the audio. So on frequencies that have multiple sources on the same frequency level, some loudness will be subtracted and it may just be the target source by accident. There is another component to we can understand these issues as well: by looking at the phase of the mixture.

When reviewing the method of the separation itself, I recalled how in especially contested frequencies the model would simply apply a proportional decrease in volume which may be the main source of the residuals. To check this hypothesis I wanted to see how much the frequencies of residuals of one source may correlate with other sources. To examine this I first graphed the periodograms of each sources residuals to see which frequencies were the most present. A periodogram is a time series model that is based on using the fourier transformed frames of a signal and then approximating the contribution of these frequency signals. Mathematically we define a periodogram as:

$$|d(\omega_j)|^2 = |\frac{1}{\sqrt{n}}\sum_{t=1}^{n}{x_t \cdot e^{-2\pi i\omega_j t}}|^2$$

Where $x_t$ are the $n$ windowed overlapping frames of the audio signal, and $\omega_j$ are the frequencies. Or

$$I(\omega_j) = \sum_{h=-(n-1)}^{n-1}{\hat{\gamma}(h) \cdot e^{-2\pi i\omega_j h}}$$

Where $\hat{\gamma}(h)$ is the sample autocorrelation, and $\omega_j$ are the frequencies.

Due to the amount of observations being used the periodograms can appear somewhat cluttered if no transformations are applied. To smooth over the periodogram to increase readability I applied a moving average function to the spectral density values.

In [ ]:
psd, freq = res.periodogram_res(res_signals, sr=44100)
smooth_psd, smooth_freq = res.norm_smooth_periodogram(psd, freq, window_size=1024)

Correlation Errors¶

Analyze the correlation between the residuals and the target source. High correlation in specific regions might indicate systematic errors in your model.

In [ ]:
res.plot_res_acf(est_signals, true_signals)

Finally to then see if the sources residuals appear to correlate with were their frequencies correlate to other sources, we used a cross spectrum. A cross spectrum is a idea we learned in class that combines two previously established ideas of the cross correlation function and the periogram. Essentially showing us what frequencies two different time series share or do not share. Mathematically a cross spectrum can be represented as:

$$f_{x,y}(\omega) = \sum_{h=-\infty}^{\infty}{\gamma_{x,y}(h) \cdot e^{-2\pi i\omega_j h}}$$

Where $\gamma_{x,y}(h)$ is the cross correlation function.

In [ ]:
for i in range(4):   
    res.source_csd(res_signals, smooth_psd, smooth_freq, i)

Looking at the ACF and PACF, this tells use the residuals are not stationary, so the residuals are not random noise. Therefore is definitely some information not being fully taken into account in the masks. And that the information is obviously sinal, which isn't surprisingly considering the data we are dealing with is music. I think in general, we can assume there is going to be some information missing from the model outputs always. This shouldn't be a warning sign as this can signal that frequencies that we are incapable of even hearing may be present or not present in the output.

In [ ]:
res.display_csd(est_signals, true_signals)
In [ ]:
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))

added_signal = np.sin(2*np.pi*19*(np.arange(len(stems[0]))))
mute_signal = stems[0] + added_signal

print("Normal Signal w Added Signal")
ipd.display(ipd.Audio((mute_signal), rate=sr))
Normal Signal
Your browser does not support the audio element.
Normal Signal w Added Signal
Your browser does not support the audio element.

Temporal Patterns¶

Check if there are temporal patterns in the residuals that could indicate specific time-based issues in your separation model, such as lag or phase issues.

In [ ]:
phase_spec, phase_est = res.get_phase(spec_data, s_hat)
phase_error = res.calculate_phase_error(phase_spec[1:], phase_est)
In [ ]:
res.plot_res_over_phase(phase_error, residuals)
In [ ]:
# # Enable the 'ipympl' backend
# %matplotlib widget

# res.display_interactive_plot_res(residuals, phase_error)

Quantatative Analysis¶

There are a few ways to look at residuals mathematically that we learned in class and that people frequently use in signal processing.

MSE and RMSE (Mean Squared Error and Root Mean Squared Error)¶

The MSE also called the L2 norm is maybe one of the most commonly used error/loss functions used in statistics. Nonetheless it's a good idea to examine the statistic to get a better understanding of the residuals. For this task the MSE would be in the form:

$$\text{MSE or }L2_{time} = \frac{1}{KT}\sum_{t,k}|\hat{y}_{t,k} - y_{t,k}|^2$$$$L1_{time} = \frac{1}{KT}\sum_{t,k}|\hat{y}_{t,k} - y_{t,k}|$$

Many modules will have this function built into to calculate MSE or L1 norms.

In [ ]:
mse, me = res.calc_norms(residuals)
Drums: MSE=0.5322632539636895, l1=0.0059182979713163905
Bass: MSE=1.4550852278186694, l1=0.021212134586395287
Other: MSE=21.749803741009515, l1=0.06929522844168201
Vocals: MSE=5.187617823185366, l1=0.03445190099871528

SDR, SNR, SIR, and SAR (Signal to Distortion, Noise, Interference, and Artifacts Ratio)¶

These are statistics used in music source separation but also in many other different applications of signal processing. With these terms it probably best to introduce a few terms of error in music source separation.

  • Noise: Also called white noise, noise is a feeding of random independently generated values or just adding other signals onto a signal, at a regular interval and with constant variance and mean. Noise will typically be a feature of any real world distribution, and typically represents random variation in a distribution over time.

  • You can simulate noise in an audio signal very easily by adding a term to the signal data itself.

In [ ]:
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))
print("Signal w/ added white noise")
white_noise = np.random.normal(0,0.05,size=len(stems[0]))
ipd.display(ipd.Audio((stems[0] + white_noise), rate=sr))
Normal Signal
Your browser does not support the audio element.
Signal w/ added white noise
Your browser does not support the audio element.
  • Interference: In music source separation, interference refers to the amount of other non-desired sources that appear in estimated source. So for example if you heard the drums in the estimation for the bass source, that would be interference.

  • You can simulate noise this by simply adding a bit of one sources' audio to another.

In [ ]:
print("Bass Signal")
ipd.display(ipd.Audio((stems[2]), rate=sr))
interference = stems[2] + (stems[1]*white_noise) # Adding the drums to the bass track at a random proportion of white noise.
print("Signal w/ interference")
ipd.display(ipd.Audio(interference, rate=sr))
Bass Signal
Your browser does not support the audio element.
Signal w/ interference
Your browser does not support the audio element.
  • Artifacts: Unwanted alteration in an audio signal, these will typically arise from compression, transmission errors, or sloppy processing mistakes.

  • One type of artifacts you can simulate is clipping, where the amplitutde of an audio exceeds the maximum value that the system can record.

In [ ]:
print("Normal Signal")
ipd.display(ipd.Audio((stems[0]), rate=sr))
clipped = np.clip(stems[0], -0.25, 0.25) 
print("Clipped Signal")
ipd.display(ipd.Audio(clipped, rate=sr))
Normal Signal
Your browser does not support the audio element.
Clipped Signal
Your browser does not support the audio element.
  • Distortion: An overall measure of the prescence of these 3 unwanted effects in a digital audio signal.

We then use these error terms to come up with this interpretation of our estimated sources.

$$\hat{s_j} = s_{\text{target}} + e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}$$

Where $\hat{s_j}$ is our estimated source for source $j$, $s_{\text{target}}$ is our true source audio w/ allowed distortion*, $e_{\text{interf}}, e_{\text{noise}}, e_{\text{artif}}$ are the error terms for interference, noise and artifacts respectively.

*When I say allowed distortion essentially it just covers any difference in the estimated source that is not covered by these three error terms.

To calculate these terms we use these formulas:

Let $\Pi\{y_1,...,y_k\}$ denote a orthogonal projector onto the subspace spanned by $y_1,..,y_k$.

$$P_{s_j} = \Pi \{ s_j \}$$$$ P_{s} = \Pi \{ s_{j\prime} \} , \text{where } j\neq j\prime $$$$ P_{s,n} = \Pi \{ s_{j\prime} \} , (n_i)_{1\le n \le m}, \text{where } n_i\text{ is noise from ith microphone} $$

And remember: $$\Pi \{ s_j \}y_k = \frac{ \langle s_j,y_k \rangle y_k}{\langle s_j,s_j \rangle} $$

Then use these for:

$$s_{\text{target}} = P_{s_j}\hat s_j$$$$e_{\text{interf}} = P_{s}\hat s_j - P_{s_j}\hat s_j$$$$e_{\text{noise}} = P_{s,n}\hat s_j - P_{s}\hat s_j$$$$e_{\text{artif}} = \hat s_j - P_{s,n}\hat s_j$$$$\text{So: } \hat s_j = (P_{s_j}\hat s_j) + (P_{s}\hat s_j - P_{s_j}\hat s_j) + (P_{s,n}\hat s_j - P_{s}\hat s_j) + (\hat s_j - P_{s,n}\hat s_j)$$

Finally the metrics are calculated with:

$$SDR = 10log_{10}\frac{||s_{\text{target}}||^2}{||e_{\text{interf}} + e_{\text{noise}} + e_{\text{artif}}||^2}$$$$SIR = 10log_{10}\frac{||s_{\text{target}}||^2}{||e_{\text{interf}}}$$$$SNR = 10log_{10}\frac{||s_{\text{target}} + e_{\text{interf}}||^2}{||e_{\text{noise}}||^2}$$$$SAR = 10log_{10}\frac{||s_{\text{target}} + e_{\text{noise}} + e_{\text{interf}}||^2}{||e_{\text{artif}}||^2}$$

Note: This can get somewhat confusing as SNR is also a commonly used metric in signal processing, however this version is different than the normal kind. The normal SNR is calculated with:

$$SNR = 10log_{10}\frac{||s_{j}||^2}{||\hat s_{j} - s_{j}||^2}$$
In [ ]:
import pandas as pd 

references = np.array(stems[1:]).T
drums_eval, bass_eval, other_eval, vocals_eval = [res.calc_evals(est_signals[i], references, i) for i in range(4)]

metrics = pd.DataFrame(data = [drums_eval, bass_eval, other_eval, vocals_eval], 
                       columns=["SISDR", "SISIR", "SISAR", 'SDSDR', "SNR", "SRR"],
                       index = ["Drums", "Bass", "Other", "Vocals"]
                       )

# Calculating the mean of each column
mean_row = metrics.mean()

# Converting the mean series to a DataFrame
mean_df = pd.DataFrame(mean_row).T

# Assigning a label for the index of the mean row
mean_df.index = ['Total Score']

# Appending the mean row to the original DataFrame
metrics = pd.concat([metrics, mean_df])

metrics
SISDR SISIR SISAR SDSDR SNR SRR
Drums 6.059788 12.621812 7.142728 6.823634 6.970171 14.714377
Bass 6.810572 12.789123 8.074067 7.477198 7.624050 14.694119
Other 9.564220 13.848223 11.590896 9.928236 10.019616 19.098387
Vocals 8.368302 14.999946 9.431727 8.844583 8.942912 18.426378
Total Score 7.700721 13.564776 9.059855 8.268413 8.389187 16.733315

For reference a very good score for a Music Source Separator for SDR is about 6. However the most state-of-the-art MSS's have reported SDR scores of up to 9.20!